{insert event} 2024
Monash University, Australia
Professor Dianne Cook, Department of Econometrics and Business Statistics, Melbourne, Monash University, Australia
Dr. Emi Tanaka, Biological Data Science Institute, Australian National University, Canberra, Australia
Assistant Professor Susan VanderPlas, Statistics Department, University of Nebraska, Lincoln, USA
Graphical approaches (plots) are the recommended methods for diagnosing residuals.
Residual plots are usually revealing when the assumptions are violated.
Graphical methods are easier to use.
Residual plots are more informative in most practical situations than the corresponding conventional hypothesis tests.
What do you observe from this residual plot?
However, this is an over-interpretation.
The fitted model is correctly specified!
The triangle shape is caused by the skewed distribution of the regressors.
The reading of residual plots can be calibrated by an inferential framework called visual inference (Buja, et al. 2009).
Typically, a lineup of residual plots consists of
To perform a visual test
To understand why regression experts consistently recommend plotting residuals for regression diagnostics, we conducted an experiment to compare conventional hypothesis testing with visual testing.
\[\boldsymbol{y} = \boldsymbol{1}_n + \boldsymbol{x} + \boldsymbol{z} + \boldsymbol{\varepsilon},~ \boldsymbol{z} \propto He_j(\boldsymbol{x}) \text{ and } \boldsymbol{\varepsilon} \sim N(\boldsymbol{0}_n, \sigma^2\boldsymbol{I}_n),\]
where \(\boldsymbol{y}\), \(\boldsymbol{x}\), \(\boldsymbol{\varepsilon}\) are vectors of size \(n\), \(\boldsymbol{1}_n\) is a vector of ones of size \(n\), and \(He_{j}(.)\) is the \(j\)th-order probabilist’s Hermite polynomials.
\[\boldsymbol{y} = \beta_0 + \beta_1\boldsymbol{x} + \boldsymbol{u}, ~\boldsymbol{u} \sim N(\boldsymbol{0}_n, \sigma^2\boldsymbol{I}_n).\]
\[\boldsymbol{y} = 1 + \boldsymbol{x} + \boldsymbol{\varepsilon},~ \boldsymbol{\varepsilon} \sim N(\boldsymbol{0}, 1 + (2 - |a|)(\boldsymbol{x} - a)^2b \boldsymbol{I}),\]
where \(\boldsymbol{y}\), \(\boldsymbol{x}\), \(\boldsymbol{\varepsilon}\) are vectors of size \(n\), and \(\boldsymbol{1}_n\) is a vector of ones of size \(n\).
\[\boldsymbol{y} = \beta_0 + \beta_1\boldsymbol{x} + \boldsymbol{u}, ~\boldsymbol{u} \sim N(\boldsymbol{0}_n, \sigma^2\boldsymbol{I}_n).\]
We use the logistic regression to estimate the power:
\[Pr(\text{reject}~H_0|H_1,E) = \Lambda\left(log\left(\frac{0.05}{0.95}\right) + \beta_1 E\right),\]
where \(\Lambda(.)\) is the standard logistic function given as \(\Lambda(z) = exp(z)/(1+exp(z))\).
The effect size \(E\) is the only predictor calculated using the KL-divergence (Kullback and Leibler, 1951).
The intercept is fixed to \(log(0.05/0.95)\) so that \(\hat{Pr}(\text{reject}~H_0|H_1,E = 0) = 0.05\).
Prolific (Palan and Schitter, 2018):
Every subject was asked to:
The visual test rejects less frequently than the conventional test, and (almost) only rejects when the conventional test does.
Data plot (No.1):
RESET test rejects the pattern (\(p\text{-value} = 0.004\)).
Visual test produces more practical \(p\text{-value} = 0.813\).
Conventional tests are more sensitive to weak departures.
Conventional tests often reject when departures are not visibly different from null residual plots.
In these cases, visual tests provide a more practical solution.
Regression experts are right. Residual plots are indispensable methods for assessing model fit.
Slides URL: https://patrickli-dec11talk.netlify.app